Josef Fruehwald
Quick intro to the LAP data
Overview of the contemporary sociophonetics workflow
The unique issues posed by the LAP data
Initial approach to addressing the issues
This first portion of the diagram is the most time intensive part of the process after fieldwork is over and before analysis begins.
Best case scenario is 10 hours of transcription for every 1 hour of audio.
| LANCS Audio | ~177 hours |
| Total Transcription time | 1770 to 2700 hours |
Time to Transcription (1 RA @ 15 hr/wk) |
2.5 to 3.5 years |
Cost of Transcription (@ $15/hr) |
$26,500 to $40,500 |
Replace this
With this
Initial experiments fine tuning a pretrained wav2vec2 model on 3.5 hours of PNC data resulted in
eval word error rate = 0.34
eval character error rate = 0.189
All ASR systems are trained with labelled audio. When properties of the training audio and the use case audio are very different, they may not perform well. This includes
Different kinds of speech (Tatman & Kasten 2017; Wassink & Gansen & Bartholomew 2022)
Different kinds of recordings
An example of training audio

which circumstances do not permit him to employ
(source: LibriSpeech (Panayotov et al. 2015))
An example of LANCS audio

well, uh, you mean monday, tuesday, wednesday, thursday, friday, saturday
Pre-processing LAP Audio
To the extent there are consistent issues across LANCS data, we can develop pre-processing workflows for them.
A: would you describe the fireplace please
B: well it was just a, I guess about three foot wide
A: would you describe the fireplace please
B: well it was just a, I guess about three foot wide
well, uh, you mean monday, tuesday, wednesday, thursday, friday, saturday
Audio metadata
Most of the processing I’m showing here was done with the librosa library in python
Time stamps for separate sessions were recorded in a yaml file
To deal with the low frequency hum, preemphasis and a high-pass filter were applied to the audio.
#loading the audio
y_fireplace, sr = librosa.load("assets/fireplace_short.wav", sr = 16000)
# default librosa preemphasis
y_fireplace2 = librosa.effects.preemphasis(y_fireplace)
# getting parameters for the highpass filter
b, a = scipy.signal.butter(N = 1, # a fairly gradual slope
Wn = 180, # critical frequency at 60*3
btype="highpass", # highpass filter
fs = 16000, # sampling rate
output= "ba") # kind of output
# The actual filtering
y_fireplace2 = scipy.signal.filtfilt(b = b, a = a, x = y_fireplace2)Before:

A: would you describe the fireplace please
B: well it was just a, I guess about three foot wide
After:

A: would you describe the fireplace please
B: well it was just a, I guess about three foot wide
The librosa package has implemented a method for decomposing audio into its percussive vs harmonic components, intended to separate drum tracks from melodies. (FitzGerald 2010; Driedger 2014)
# short-time fourier transform
D = librosa.stft(y_fireplace2, n_fft = 2048, win_length = 512, hop_length = 512//4)
# decomposition into harmonic and percussive
D_harm, D_perc = librosa.decompose.hpss(D, margin = 3)
# capture residual component
D_resid = D - (D_harm + D_perc)
# separating magnitude from phase
D_perc_m, D_perc_p = librosa.magphase(D_perc)
# converting to db
D_perc_db = librosa.amplitude_to_db(D_perc_m)
# just subtrating 20db from percussive
D_perc_db = D_perc_db-20
#back to amplitude
D_perc_new_m = librosa.db_to_amplitude(D_perc_db)
# recombining with phase
D_perc_new = D_perc_new_m * D_perc_p
# adding it all together
new_D = D_harm + D_perc_new + D_resid
# back to signal
y_fireplace3 = librosa.istft(new_D, n_fft = 2048, win_length = 512, hop_length = 512//4)
# re-normalize the output
y_fireplace3 = librosa.util.normalize(y_fireplace3)def dampen_hit(y,
sr = 16000,
n_fft = 2048,
win_length = 512,
hop_length = 512//4,
margin = 3,
by_db = 20):
"""
using harmonic/percussive decomposition, dampen mic hits
"""
D = librosa.stft(y, n_fft = n_fft, win_length = win_length, hop_length = hop_length)
# decomposition into harmonic and percussive
D_harm, D_perc = librosa.decompose.hpss(D, margin = margin)
# capture residual component
D_resid = D - (D_harm + D_perc)
# separating magnitude from phase
D_perc_m, D_perc_p = librosa.magphase(D_perc)
# converting to db
D_perc_db = librosa.amplitude_to_db(D_perc_m)
# just subtrating 20db from percussive
D_perc_db = D_perc_db-by_db
#back to amplitude
D_perc_new_m = librosa.db_to_amplitude(D_perc_db)
# recombining with phase
D_perc_new = D_perc_new_m * D_perc_p
# adding it all together
new_D = D_harm + D_perc_new + D_resid
# back to signal
out_signal = librosa.istft(new_D, n_fft = n_fft, win_length = win_length, hop_length = hop_length)
# re-normalize the output
out_signal = librosa.util.normalize(out_signal)
return(out_signal)Before

A: would you describe the fireplace please
B: well it was just a, I guess about three foot wide
After

A: would you describe the fireplace please
B: well it was just a, I guess about three foot wide
There are a few methods out there for noise reduction, including “Per Channel Energy Normalization” (Wang et al. n.d.; Lostanlen et al. 2019) and “Spectral Gating” (Sainburg & Thielk & Gentner 2020; Sainburg et al. 2022). I’ve found spectral gating to have better results for the final audio.
You start off with a spectrogram
Then, you smear it out across the time domain. Since I’m dealing with pretty stable consistent noise, I’ve chosen a long window for smearing (3 seconds).
see how much above background the signal is

…convert that into a multiplier between 0 and 1

Soften out the edges a bit more

Multiply it by the original spectrogram

before

after

The Very Start

The end

well, uh, you mean monday, tuesday, wednesday, thursday, friday, saturday
from noisereduce import reduce_noise
y_week, sr = librosa.load("assets/weekday_short.wav", sr = 16000)
y_week2 = librosa.effects.preemphasis(y_week)
y_week3 = highpass(y = y_week2)
y_week4 = librosa.util.normalize(dampen_hit(y_week3))
y_week5 = librosa.effects.deemphasis(y_week4)
y_week6 = reduce_noise(y_week5,
sr=16000,
n_fft=2048,
win_length = 512,
hop_length=512//4,
time_constant_s=3,
thresh_n_mult_nonstationary=3,
sigmoid_slope_nonstationary=3.5,
freq_mask_smooth_hz=500,
time_mask_smooth_ms=100)
y_week7 = librosa.util.normalize(y_week6)well, uh, you mean monday, tuesday, wednesday, thursday, friday, saturday
conda env create -f audioprocess.yml to install all the dependenciesconda activate audioprocess.ymlIt might be possible to speed up the diarization process by

IVR: I have a number of uh things I'd like to ask you about, I wonder if you wouldn't mind answering questions one after another
KY25A: Yeah well, you might start that I was born in 1867